Multiword Expression Filtering for Building Knowledge Maps
نویسنده
چکیده
This paper describes an algorithm that can be used to improve the quality of multiword expressions extracted from documents. We measure multiword expression quality by the “usefulness” of a multiword expression in helping ontologists build knowledge maps that allow users to search a large document corpus. Our stopword based algorithm takes n-grams extracted from documents, and cleans them up to make them more suitable for building knowledge maps. Running our algorithm on large corpora of documents has shown that it helps to increase the percentage of useful terms from 40% to 70% – with an eight-fold improvement observed in some cases.
منابع مشابه
Multiword Expression Filtering For Building Knowledge
This paper describes an algorithm that can be used to improve the quality of multiword expressions extracted from documents. We measure multiword expression quality by the “usefulness” of a multiword expression in helping ontologists build knowledge maps that allow users to search a large document corpus. Our stopword based algorithm takes ngrams extracted from documents, and cleans them up to ...
متن کامل$xwrpdwlf 'lvfryhu\ Dqg $jjuhjdwlrq Ri &rpsrxqg 1dphv Iru Wkh 8vh Lq .qrzohgjh 5hsuhvhqwdwlrqv
$EVWUDFW Automatic acquisition of information structures like Topic Maps or semantic networks from large document collections is an important issue in knowledge management. An inherent problem with automatic approaches is the treatment of multiword terms as single semantic entities. Taking company names as an example, we present a method for learning multiword terms from large text corpora expl...
متن کامل$xwrpdwlff'lvfryhu\dqgg$jjuhjdwlrqqrii&rpsrxqgg 1dphvviruuwkhh8vhhlq.qrzohgjhh5hsuhvhqwdwlrqvv
Automatic acquisition of information structures like Topic Maps or semantic networks from large document collections is an important issue in knowledge management. An inherent problem with automatic approaches is the treatment of multiword terms as single semantic entities. Taking company names as an example, we present a method for learning multiword terms from large text corpora exploiting th...
متن کاملAcquiring Translation Equivalences of Multiword Expressions by Normalized Correlation Frequencies
In this paper, we present an algorithm for extracting translations of any given multiword expression from parallel corpora. Given a multiword expression to be translated, the method involves extracting a short list of target candidate words from parallel corpora based on scores of normalized frequency, generating possible translations and filtering out common subsequences, and selecting the top...
متن کاملMultiword Sequences as Building Blocks for Language: Insights into First and Second Language Learning
Many grammatical frameworks view words and rules as the basic building blocks of language, with multiword sequences being treated as peripheral exceptions in the form of idioms, etc. (e.g., Pinker, 1999). The new millennium, however, has seen a shift toward construing multiword sequences not as linguistic rarities but as important building blocks for language acquisition and processing. Based o...
متن کامل